Finding meaningful patterns within data has become obtrusive as data collection and management continues to grow at an unprecedented rate
By the end of this presentation we will have discussed the following concepts of k-means clustering:
There are 5 main steps to execute the k-means clustering method
Clustering is the act of partitioning data into meaningful groups based on similarity of attributes
The goal of clustering is to create insightful clusters to better understand connections in the data
\[d(x,C_i)=sqrt(\sum_{i=1}^{N} (x_j−C_{ij})^2)\]
Objective Function:
It is formulated as:
\[ d(x,C_i)=(\sum_{i=1}^{k}*\sum_{x \in C_i}^{}(||x-\mu_i||)^2) \]
\(k\) is the number of clusters
\(C_i\) represents the number of points in the cluster \(i\)
\(\mu_i\) represents the centroid mean of cluster \(i\)
In this context, similarity is inversely related to the Euclidean distance
The smaller the distance, the greater the similarity between objects
K-means clustering reassigns the data points to each cluster based on the Euclidean Distance calculation
A new centroid location is set by updating the position of each clusters mean center
library(ggplot2)
# Plot the WCSS values against the number of clusters
p1<- ggplot(data.frame(K=1:10, WCSS=wcss), aes(x=K, y=WCSS)) +
geom_line() +
geom_point() +
labs(title="Elbow Method to Find Optimal K", x="Number of Clusters (K)", y="Within-Cluster-Sum-of-Squares (WCSS)") +
scale_x_continuous(breaks = seq(0, 10, by = 1))Cluster 1:
Cluster 2:
Cluster 3:
Cluster 4:
Challenges and Considerations -
Data Handling:
Managing large and noisy datasets
Robustness:
Ensuring robustness against outliers
Cluster Number Determination:
Defining an appropriate number of clusters
Research Focus -
Continued Exploration:
Ongoing refinement of clustering techniques and cluster selection process
Industry Evolution:
Adapting newer methods to meet evolving e-commerce demands